Skip to content

Non-record: Negative results — hardware alignment & quantization on 8xH100#670

Open
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:negative-results-hardware-alignment
Open

Non-record: Negative results — hardware alignment & quantization on 8xH100#670
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:negative-results-hardware-alignment

Conversation

@abaybektursun
Copy link
Contributor

@abaybektursun abaybektursun commented Mar 25, 2026

Summary

Key finding

The 82ms training step is 95%+ optimized. torch.compile (PyTorch 2.9.1) handles all fusion automatically. cuBLAS is at the hardware limit for K=512. The competition at d=512 on H100 is won by quantization quality (bits-per-parameter), not kernel engineering (FLOPS-per-second).


Kernel-Level Optimization (All Dead)

Approach Result Why It Failed
CUTLASS SM90 TMA+WGMMA GEMM 2.5× slower than cuBLAS cuBLAS heuristics beat default CUTLASS for 98304×512×1536. Built a working kernel — correct results, wrong speed.
Fused Triton GEMM + LeakyReLU² 1.82× faster fwd, 2.7× slower fwd+bwd `torch.autograd.Function` bypasses Inductor. Backward runs in eager mode, 2-3× slower than Inductor's auto-generated Triton backward.
`torch.library.triton_op` for GEMM Compile error FakeTensor can't provide `data_ptr()` — GEMM kernels incompatible with triton_op tracing.
Custom CUDA C++ fused activation 6% slower PyTorch's `vectorized_elementwise_kernel` is already highly optimized for pointwise ops.
Fused norm+residual (Triton) Ties torch.compile exactly 0.136ms ours vs 0.136ms Inductor-generated. torch.compile already fuses this pattern.
FP8 training (TransformerEngine) No speedup (90 vs 89ms) At d=512, attention GEMMs are already memory-bound (AI=170-255). FP8 doubles peak FLOPS but also doubles the ridge point, making MORE ops memory-bound.
QKV fusion (8Q/4KV GQA) 3-17% slower Fused (512→1024) GEMM is slightly faster, but splitting output into non-contiguous Q(512)/K(256)/V(256) tensors costs more than the GEMM savings.

Conclusion: torch.compile (PyTorch 2.9.1) already fuses CE+softcap+tanh, LeakyReLU²+residual, RMSNorm+backward, and all pointwise chains. cuBLAS is at the hardware limit for K=512 (~48% roofline, pipeline depth limitation). The 82ms step is 95%+ optimized.

torch.compile Gotchas

Issue Impact Mechanism
Late QAT recompilation OOM with larger models Flipping `_qat_enabled` mid-training changes the forward graph → torch.compile recompiles → memory spike exceeds 80GB
`torch.autograd.Function` 2-3× slower backward Custom Functions bypass Inductor entirely. Backward runs uncompiled eager Python ops.
H100 memory compression 25-50% inflated benchmarks Synthetic data (cudaMemset, BlockFillRandom, zeros) compresses in HBM hardware. Only `torch.randn` gives real numbers.

Quantization Experiments (Diminishing Returns)

Approach BPB Delta Why It Failed
SpinQuant (Hadamard rotation before GPTQ) 1.1151 −0.0002 GPTQ's actorder + Cholesky already handles outliers. Rotation adds little on top. Artifact slightly larger (rotated weights compress worse).
Mixed-precision int5/int8 per-layer 1.1209 +0.006 int5 (31 levels) is too coarse. Boundary layers at int8 can't compensate for middle layers losing half their precision.
Soft-Round QAT (differentiable rounding) 1.1151 −0.0002 Identical to standard STE — the ~500 QAT steps aren't enough for the temperature annealing to have effect.
Selective ±1 pruning at 28-37% 1.1198-1.1204 +0.004-0.005 Too aggressive. Only <10% pruning is loss-neutral.

Architecture & Training (All Negative)

Approach BPB Delta Why It Failed
XSA on all 11 layers (vs last 4) worse at 100s +0.014 2.9ms/step overhead. In our Parallel Muon stack, the slower step time costs more than XSA gains.
Value Residual Learning 1.1179 +0.0008 VRL conflicts with VE128 — both inject identity info into deep attention layers. Redundant.
Gated Attention 1.1197 +0.0026 4% slower step time. Per-head sigmoid gates add overhead not compensated by quality.
Weight decay 0.08 (vs 0.04) 1.1235 +0.008 Better at 100s, WORSE at 600s. Over-regularization prevents learning during warmdown. Early loss does not predict final post-quant BPB.
Batch size 1M tokens 1.1197 +0.003 Fewer steps (5,526 vs 7,189) hurt more than better gradients help.
Train bigger d=576 + int5 1.1233 +0.006 110ms/step = 24% fewer steps. Scaling law can't compensate.
Shard ordering (hard→easy) 1.1162 +0.0009 Per-shard loss spread only 0.3%. Disrupts natural diversity.
Legal TTT (22 experiments) 1.1177 best +0.0006 Score-first constraint means model adapts too late.
Hessian all-reduce across GPUs 1.1169 −0.0002 256 batches per GPU already sufficient.

Meta-Lessons

  1. The step is 95%+ optimized. torch.compile handles all fusion, cuBLAS is at hardware limit, FA3 already in use.
  2. H100 is massively overprovisioned for this model. 21.5GB of 80GB GPU used. 99% of NVLink idle.
  3. The competition is bits-per-parameter, not FLOPS-per-second. The quantization gap (0.022 BPB) is 10× larger than any kernel optimization.
  4. Stale processes from nohup+torchrun accumulate silently, causing 2-3× performance degradation.
  5. Early training loss doesn't predict final BPB. Fast A/B tests filter bad ideas but can't confirm good ones.

Test plan

🤖 Generated with Claude Code

…xH100

30+ experiments on the PR openai#593 stack (1.1171 BPB), all negative or marginal:
- CUTLASS SM90 GEMM: 2.5x slower than cuBLAS
- Fused Triton GEMM+activation: autograd.Function kills backward
- FP8, QKV fusion, custom CUDA: all slower or no improvement
- SpinQuant, mixed int5/int8, Soft-Round QAT: noise-level
- XSA-all, VRL, Gated Attention, bigger model, shard ordering: all worse
- 22 legal TTT experiments: all worse than non-TTT baseline

Key finding: 82ms step is 95%+ optimized. torch.compile handles all fusion.
Competition at d=512 is bits-per-parameter, not FLOPS-per-second.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant